Prepare the data.
The data used in this exercise was generated by performing whole genome sequencing analysis of metastatic prostate tumors. This assay allows us to identify structural variations in tumor genomes such as duplications, inversions, and large deletions.
For more details on the analysis and results, see Quigley et al. Cell 2018.
We counted the number of structural variations in each tumor sample. For our purposes, variants come in five types: inversions, deletions, duplications, insertions, and translocations. We also assessed each biopsy to identify biallelic inactivating mutations in each gene. Read in summaries of the number of structural variants present in each tumor and the presence/absence of mutations for a few selected genes.
sv = read.table("SV_summary_table.txt", sep='\t', header=TRUE, stringsAsFactors=FALSE)
mut = read.table("WCDT_mutations.txt", sep='\t', header=TRUE, stringsAsFactors=FALSE)
suppressPackageStartupMessages({
library(ggplot2)
library(reshape2)
library(dplyr)
library(tidyr)
library(ggpubr)
library(purrr)
library(RColorBrewer)
library(plotly)
})
## Warning: package 'ggplot2' was built under R version 4.3.1
## Warning: package 'dplyr' was built under R version 4.3.1
## Warning: package 'plotly' was built under R version 4.3.1
The sv matrix object reports counts of five types of structural variants (SV) in 101 patient tumor biopsies.
QUESTION 1.1: Plot a summary of the distributions for each type of SV individually. Do any of the distributions have outliers, defined as “values that exceed the whiskers of a boxplot”?
sv_pivot = sv %>%
mutate(id = rownames(sv)) %>%
gather(key = "variant_type", value = "count", -id)
ggplot(data = sv_pivot, aes(x = variant_type, y = count)) +
geom_boxplot(outlier.shape = NA) +
geom_point(position = position_dodge2(0.5), size = 2, shape = 21, color = "black", fill = "gray") +
xlab("Variant Type") +
ylab("Counts") +
theme_pubr() +
theme(text = element_text(size = 15))
Most of the distributions plotted above have outliers, whereas the distribution of insertions is a lot less variable compared to the distributions of deletions and translations.
QUESTION 1.2 Which SV types are most frequent? Least frequent?
sv_pivot %>%
split(.$variant_type) %>%
map(function(x){
summary(x$count)
})
## $deletions
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 25.0 59.0 80.0 100.9 116.0 455.0
##
## $duplications
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.00 23.00 31.00 50.49 50.00 393.00
##
## $insertions
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.000 6.000 8.693 12.000 54.000
##
## $inversions
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 48.00 69.00 79.62 91.00 387.00
##
## $translocations
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.00 55.00 83.00 97.53 120.00 410.00
Based on the summary statistics, deletions are the most frequent SV type. However, translocations come in close second and the differences in counts between deletions and translocations are not statistically significant. Insertions are the least frequent SV type.
Some analyses are contingent on distributional assumptions. For example, they may assume values are normally distributed. We can test the assumption that a sample is normally distributed with a QQ plot:
set.seed(124)
x=rnorm(100)
qqnorm( x )
qqline( x )
QUESTION 2.1 Generate a QQ plot to evaluate the assumption that deletions are normally distributed. Are deletions normally distributed?
deletion = sv$deletions
qqnorm(deletion)
qqline(deletion)
As the figure shows, the deletions are not normally distributed.
Data that are not normally distributed can be coerced towards a normal distribution by transforming the data. How could you transform the distribution of deletions so that it’s closer to normal?
QUESTION 3.1 Replot the QQ plot after performing a log transformation to see what effect your transformation had. Did this transformation make the sample more similar to a normal distribution?
Note: leave the data as their original counts for the rest of the questions. Just transform for this question.
log_deletion = log(deletion, base = 10)
qqnorm(log_deletion)
qqline(log_deletion)
A log transformation makes the data look more similar to a normal distribution.
Structural variations arise from DNA damage that is not repaired. By analyzing tumor genomes, we can figure out the kind of DNA damage that occurred by studying the patterns of SVs.
The null model is that there is no relationship between the number of any type of SV. Alternatively, there might be an association between some of the SV types, suggesting something in common about their etiology.
QUESTION 4.1 Calculate pairwise correlation between all five types of SV and plot the resulting correlation coefficients as a heat map. Try both Pearson correlation (the parametric default method in R) and non-parametric Spearman rank correlation, to see if it matters.
Hint for plotting a simple correlation heatmap, where the matrix X contains the values to plot in the heatmap:
correlation_df = cor(sv, method = "pearson")
correlation_df = melt(correlation_df)
correlation_df = correlation_df %>%
mutate(label1 = sapply(Var1, function(x){
x = as.character(x)
paste(unlist(strsplit(x, split = ""))[1:3], collapse = "")
})) %>%
mutate(label2 = sapply(Var2, function(x){
x = as.character(x)
paste(unlist(strsplit(x, split = ""))[1:3], collapse = "")
})) %>%
mutate(label = paste(label1, label2, sep = " ~ "))
ggplot(correlation_df, aes(x = Var1, y = Var2)) +
geom_tile(aes(fill = value)) +
geom_text(aes(label = label)) +
labs(fill = "Pearson Corr") +
scale_fill_gradient(low = "#345082", high = "#fc8d62") +
xlab("") +
ylab("") +
theme_minimal() +
theme(text = element_text(size = 15),
axis.text.x = element_text(angle = 45, hjust = 1))
correlation_df = cor(sv, method = "spearman")
correlation_df = melt(correlation_df)
correlation_df = correlation_df %>%
mutate(label1 = sapply(Var1, function(x){
x = as.character(x)
paste(unlist(strsplit(x, split = ""))[1:3], collapse = "")
})) %>%
mutate(label2 = sapply(Var2, function(x){
x = as.character(x)
paste(unlist(strsplit(x, split = ""))[1:3], collapse = "")
})) %>%
mutate(label = paste(label1, label2, sep = " ~ "))
ggplot(correlation_df, aes(x = Var1, y = Var2)) +
geom_tile(aes(fill = value)) +
geom_text(aes(label = label)) +
labs(fill = "Spearman Corr") +
scale_fill_gradient(low = "#345082", high = "#fc8d62") +
xlab("") +
ylab("") +
theme_minimal() +
theme(text = element_text(size = 15),
axis.text.x = element_text(angle = 45, hjust = 1))
As the figures show, the Pearson correlation and Spearman correlation between different types of SVs are shown as a heatmap, where orange indicates higher correlation and blue indicates lower correlation.
QUESTION 4.2 Which types of SV are most strongly correlated with each other? Without formal testing, do the correlation data support the null model, or is there reason to investigate an alternative model? Which pairs of SV are most likely to occur in similar counts?
correlation_pearson = cor(sv, method = "pearson")
correlation_pearson[upper.tri(correlation_pearson)] = NA
correlation_pearson = correlation_pearson %>%
melt() %>%
arrange(desc(value)) %>%
filter(Var1 != Var2) %>%
filter(!is.na(value))
correlation_spearman = cor(sv, method = "spearman")
correlation_spearman[upper.tri(correlation_spearman)] = NA
correlation_spearman = correlation_spearman %>%
melt() %>%
arrange(desc(value)) %>%
filter(Var1 != Var2) %>%
filter(!is.na(value))
correlation_pearson
## Var1 Var2 value
## 1 duplications inversions 0.40137286
## 2 insertions translocations 0.40034975
## 3 inversions deletions 0.32947337
## 4 duplications translocations 0.29742552
## 5 inversions translocations 0.26117750
## 6 insertions deletions 0.15213080
## 7 insertions duplications 0.15031741
## 8 insertions inversions 0.12568566
## 9 translocations deletions 0.11703097
## 10 duplications deletions 0.05549646
correlation_spearman
## Var1 Var2 value
## 1 duplications inversions 0.6605935
## 2 inversions deletions 0.5757338
## 3 duplications deletions 0.5045968
## 4 insertions translocations 0.4686130
## 5 inversions translocations 0.3441440
## 6 translocations deletions 0.2786888
## 7 duplications translocations 0.2704208
## 8 insertions deletions 0.2560034
## 9 insertions inversions 0.2337908
## 10 insertions duplications 0.1750751
Based on both qualitative scrutiny as well as the heatmap and the scatter plot, counts of inversions and deletions as well as inversions and translocatiosn are most strongly correlated with each other. The NULL model is not supported by this observation. Therefore there are reasons to investigate an alternative model. Duplications and inversions are most likely to occur in similar counts.
QUESTION 4.3 (BONUS): Why are the correlation values so different when comparing Spearman rank correlation to Pearson correlation? What might be driving these differences? Does it matter?
correlation_pearson = correlation_pearson %>%
mutate(corr = paste(Var1, Var2, sep = "~")) %>%
mutate(pearson = value) %>%
select(pearson, corr)
correlation_spearman = correlation_spearman %>%
mutate(corr = paste(Var1, Var2, sep = "~")) %>%
mutate(spearman = value) %>%
select(spearman, corr)
ps_comp = full_join(correlation_pearson, correlation_spearman, by = "corr") %>%
select(corr, pearson, spearman)
ggplot(data = ps_comp, aes(x = pearson, spearman)) +
geom_point(aes(fill = corr), size = 4, color = "black", shape = 21) +
labs(fill = "") +
xlab("Pearson Correlation") +
ylab("Spearman Correlation") +
theme_pubr() +
theme(text = element_text(size = 15), legend.position = "right")
The correlation values are different when comparing Spearman rank correlation to Pearson correlation because Spearmen rank correlation is a non-parametric method that considers the rank of the counts instead of the absolute values of them. Given the lack of normality in the dataset, the parametric method and the non-parametric method will likely yield different results. Particularly, outliers might be driving the differences in Spearman and Pearson correlations as they have higher impact on Pearson correlation than they do on Spearman correlation.
ggplot(data = sv, aes(x = duplications, y = deletions)) +
geom_point(fill = "gray", size = 3, color = "black", shape = 21) +
xlab("Duplications") +
ylab("Deletions") +
theme_pubr() +
theme(text = element_text(size = 15))
These differences will matter more if both observations have extreme outliers. Based on the differences in Spearman and Pearson correlation, the correlation between Deletions and duplications best examplifies the differences between the two methods. In both duplications and deletions, there are extreme values that disagree with the overall trend of the dataset. These outliers impact Pearson correlation a lot more than Spearman correlation.
Let’s drill down on two contrasting pairs of SVs:
QUESTION 5.1 Create two scatter plots: duplications vs inversions, and deletions vs. inversions. Based on question 4, we’d expect the counts of these SVs to be somewhat correlated with each other. Does this hold up? Are there samples that are outliers from the linear trend in these comparisons?
Samples that deviate from a relationship like this might be of particular interest.
ggplot(data = sv, aes(x = duplications, y = inversions)) +
geom_point(fill = "gray", size = 3, color = "black", shape = 21) +
# geom_smooth(method = "lm", color = "black", formula = "y ~ x") +
xlab("Duplications") +
ylab("Inversions") +
theme_pubr() +
theme(text = element_text(size = 15))
ggplot(data = sv, aes(x = deletions, y = inversions)) +
geom_point(fill = "gray", size = 3, color = "black", shape = 21) +
# geom_smooth(method = "lm", color = "black", formula = "y ~ x") +
xlab("Deletions") +
ylab("Inversions") +
theme_pubr() +
theme(text = element_text(size = 15))
These correlations do not hold up for the most part but there are obvious ourliers present.
Let’s drill down on the duplications. Look at your plot comparing the number of duplications to the number of inversions. Note that there are three samples that have far more duplications than any other sample, and four samples that have both a large number of inversions and a large (though not the highest) number of duplications. Looking at this plot, we might wonder if there is something special about the three samples that have a lot of duplications but not a lot of inversions.
The mut matrix that you loaded contains one row for each sample and one column for each of 15 genes, with a TRUE value if there is a biallelic inactivation of that gene in that sample.
QUESTION 6.1 We’ll test the hypothesis that tumors with a particular gene inactivation acquire a lot more duplications than tumors lacking that inactivation. Test this hypothesis by performing a Wilcoxon test comparing the number of duplicates in tumors with vs. without mutation in each gene. Create a barplot of -log10( p ) and nominate the strongest hit as worthy of further investigation.
test_gene_duplications = function(gene_name = "CHD1", method = "wilcox"){
test_df = tibble(
id = rownames(mut),
mutation = ifelse(mut[[gene_name]], "Mut", "Non-Mut")
) %>%
left_join(mutate(sv, id = rownames(sv)), by = "id") %>%
select(id, mutation, duplications)
if(method == "wilcox"){
wilcox_result = wilcox.test(duplications ~ mutation, data = test_df)
result_df = tibble(
gene_name = gene_name,
beta = wilcox_result$statistic,
p_value = wilcox_result$p.value
)
}else{
ttest_result = t.test(duplications ~ mutation, data = test_df)
result_df = tibble(
gene_name = gene_name,
beta = ttest_result$statistic,
p_value = ttest_result$p.value
)
}
return(result_df)
}
gene_test_result = as.list(names(mut)) %>%
map(test_gene_duplications) %>%
reduce(rbind.data.frame) %>%
arrange(p_value) %>%
mutate(nlogp = -log(p_value, base = 10))
gene_test_result$gene_name = factor(gene_test_result$gene_name, levels = gene_test_result$gene_name)
ggplot(data = gene_test_result, aes(x = gene_name, y = nlogp)) +
geom_bar(stat = "identity") +
xlab("") +
ylab("-log(p-value)") +
theme_pubr() +
theme(text = element_text(size = 15),
axis.text.x = element_text(angle = 45, hjust = 1, face = "italic"))
Now let’s take a close look at deletions. Look at the scatter plot you made comparing inversions vs deletions. There tends to be a linear relationship, even for samples with large numbers of inversions, but there is a set of samples that have lots of deletions but not a lot of inversions.
QUESTION 7.1 You might hypothesize that tumors that have inactivated a DNA repair gene harbor a lot more deletions than normal tumors. Test this hypothesis by performing a Wilcox test comparing the number of deletions in tumors with vs. without mutation in each gene. This time, instead of a barplot, create a volcano plot to compare the difference in means (as the effect) vs. the -log10(p) as the statistical strength. Does this analysis nominate any gene as associated with large numbers of deletions?
test_gene_deletions = function(gene_name = "CHD1", method = "wilcox"){
test_df = tibble(
id = rownames(mut),
mutation = ifelse(mut[[gene_name]], "Mut", "Non-Mut")
) %>%
left_join(mutate(sv, id = rownames(sv)), by = "id") %>%
select(id, mutation, deletions)
if(method == "wilcox"){
wilcox_result = wilcox.test(deletions ~ mutation, data = test_df)
w = wilcox_result$statistic
n1 = table(test_df$mutation)[1]
n2 = table(test_df$mutation)[2]
beta = (w - (n1 * (n1 + n2 + 1) / 2)) / sqrt((n1 * n2 * (n1 + n2 + 1)) / 12)
result_df = tibble(
gene_name = gene_name,
beta = beta,
p_value = wilcox_result$p.value
)
}else{
ttest_result = t.test(deletions ~ mutation, data = test_df)
result_df = tibble(
gene_name = gene_name,
beta = ttest_result$statistic,
p_value = ttest_result$p.value
)
}
return(result_df)
}
gene_test_result = as.list(names(mut)) %>%
map(test_gene_deletions) %>%
reduce(rbind.data.frame) %>%
arrange(p_value) %>%
mutate(nlogp = -log(p_value, base = 10))
plt = ggplot(data = gene_test_result, aes(x = beta, y = nlogp, gene_name = gene_name)) +
geom_point(size = 3, shape = 21, color = "black", fill = "gray") +
ggtitle(label = "Volcano plot using Wilcox test") +
xlab("Beta") +
ylab("-log(p-value)") +
theme_pubr() +
theme(text = element_text(size = 15))
ggplotly(plt)
Gene names can be displayed by hovering over the scatter points.
As this figure shows, this analysis points to genes such as BRCA2 that is associated with large numbers of deletions.
The formula for the Z statistic in the Wilcoxon rank-sum test is given by:
\[ Z = \frac{W - \frac{n_1(n_1 + n_2 + 1)}{2}}{\sqrt{\frac{n_1n_2(n_1 + n_2 + 1)}{12}}} \]
where: - \(W\) is the Wilcoxon rank-sum test statistic, - \(n_1\) and \(n_2\) are the sample sizes for the two groups being compared.
This formula is used to calculate the effect size in the context of the Wilcoxon rank-sum test. The effect size provides a measure of the magnitude of the difference between two groups.
QUESTION 7.2 (BONUS) Re-run the test you performed in problem 7.1, using a t test rather than a Wilcox test. Explain why these tests have different performance and produce different results. Discuss the difference between ranking candidates based on P value and based on a combination of P value and effect size.
gene_test_result = as.list(names(mut)) %>%
map(test_gene_deletions, method = "t.test") %>%
reduce(rbind.data.frame) %>%
arrange(p_value) %>%
mutate(nlogp = -log(p_value, base = 10))
plt = ggplot(data = gene_test_result, aes(x = beta, y = nlogp, gene_name = gene_name)) +
geom_point(size = 3, shape = 21, color = "black", fill = "gray") +
ggtitle(label = "Volcano plot using T test") +
xlab("Beta") +
ylab("-log(p-value)") +
theme_pubr() +
theme(text = element_text(size = 15))
ggplotly(plt)
Gene names can be displayed by hovering over the scatter points.
T test as a parametric test can be heavily impacted by outliers, micronumerosity, and well as unequal variances between the groups. In this case, the genes with strong negative effect sizes are likely ones with outlliers, and were therefore predicted to be significant by t test. However, the Wilcox test can attenuate the effect that these outliers have on the overall statistics.